16 ◾ Bioinformatics
Use “gzip -d” to decompress a compressed file.
gzip -d SRR030834.fastq.gz
If you need to know the number of records in a FASTQ file, you can use a combination of
“cat” or “zcat” and “wc -l”, which counts the number of lines in a text file. Remember that
a record in a FASTQ file has 4 lines. We can use the Unix pipe symbol “|” to transfer the
output of the “cat” command to the “wc -l” command. The following command line will
count the number of records stored in the FASTQ files:
cat SRR030834.fastq | echo $((`wc -l`/4))
If we need to display the file name and read count for multiple files, with the “.fastq” file
name extension, in a directory, we can use the following script:
for filename in *.fastq;
do
echo -e “$filename\t `cat $filename | wc -l | awk ‘{print $1 /
4}’`”
done
To display a FASTQ file in a tabular format, you can use the “cat” command and then use
the Unix pipe to transfer the output to the “paste” command, which converts the four lines
of the FASTQ records into tabular format.
cat SRR030834.fastq | paste - - - - > SRR030834_tab.txt
The command will store the new tabular file in a new file “SRR030834_tab.txt”. You can
open this file in any spreadsheet, or you can display it as follows:
less -S SRR030834_tab.txt
Creating a tabular file from a FASTQ file will help us to perform several operations such as
sorting of the entries, filtering out the duplicate reads, extracting read IDs, sequences, or
quality scores, and creating a FASTA file. We expect that the format of the identifier lines
of a FASTQ file is consistent. If you display “SRR030834_tab.txt”, you will notice that some
of the identifier line fields are separated by spaces, and if we consider the space as a column
separator, the IDs will be in the first column and the sequence will be in the fourth column.
However, this column order may be different in tabular files extracted from other FASTQ
files. Assume that we wish to extract only the IDs and sequences from “SRR030834_tab.
txt” in a separate text file, then we can use the “awk” command as follows:
awk ‘{print $1 “\t” $4}’ SRR030834_tab.txt > SRR030834_seq.txt